fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log by auricom · Pull Request #3230 · evstack/ev-node

auricom · 2026-04-07T13:46:46Z

Summary

Hardens Raft HA behavior across several failure modes observed in SIGTERM cycling tests.

Leader fencing on SIGTERM

Added ResignLeader(ctx context.Context) error to raft.Node and FullNode (via LeaderResigner interface)
SIGTERM handler in run_node.go calls resigner.ResignLeader(resignCtx) directly with a 3s deadline — no wrapper goroutine needed; context.DeadlineExceeded vs other errors are logged separately
ResignLeader runs leadershipTransfer in a goroutine and selects on the result vs ctx.Done(), so the caller is unblocked at the deadline (the inner transfer exits on its own, bounded by ElectionTimeout)
ShutdownTimeout field (default 5s) added to raft.Config so Stop() drains committed logs with a meaningful timeout

Raft log compaction

Added SnapshotThreshold (default 500) and TrailingLogs (default 200) to RaftConfig and wired them into the hashicorp/raft config — previously using library defaults (8192 / 10240, i.e. snapshots every ~2.25h at 1 blk/s)
Fixed SnapCount default 0 → 3 (0 caused no snapshot files to be retained on disk)

Election timeout config

Added ElectionTimeout to RaftConfig (default 1s)
Validate() rejects negative values explicitly (< 0 is invalid; 0 remains valid as "use library default"); also validates >= HeartbeatTimeout when non-zero to prevent hashicorp/raft panicking at startup

Follower crash on restart (EVM ahead of stale snapshot)

RecoverFromRaft now verifies the local block hash at raftState.Height; a hash match skips unnecessary recovery safely, a mismatch returns an explicit error indicating a fork
waitForMsgsLanded replaced its repeating deadline ticker with a one-shot time.NewTimer and adds a final check in the deadline branch — prevents a race where both poll and deadline tickers fired simultaneously and timeout won even when AppliedIndex >= CommitIndex

FSM data race

FSM.Apply and SetApplyCallback are now guarded by applyMu sync.RWMutex to prevent the raft library calling Apply concurrently while SetApplyCallback(nil) clears the channel pointer during shutdown

BoltDB log noise

A one-time stdlib log filter (sync.Once in NewNode) silently drops "tx closed" lines from raft-boltdb — these are emitted on every successful commit due to defer tx.Rollback() and have no actionable meaning
Filter wraps log.Writer() instead of hard-coding os.Stderr so any existing stdlib logger redirection is preserved

Block provenance audit log

RaftApplyMsg carries a Term field; FSM.Apply() logs raft_term and raft_index alongside block height and hash for replication debugging

Test plan

go test ./pkg/raft/... ./pkg/config/... ./node/... -count=1 — all pass
TestNodeResignLeader_NilNoop / TestNodeResignLeader_NotLeaderNoop — nil/non-leader no-op paths
TestFullNode_ResignLeader_Noop — table-driven nil and non-leader cases
TestNewNode_SnapshotConfigApplied — snapshot config wiring (table-driven)
TestSyncerStop_CallsRaftRetrieverStop — Syncer.Stop() calls raftRetriever.Stop()
TestRecoverFromRaft_GetHeaderFailure — hash mismatch error path
"negative election timeout rejected" — Validate() rejects ElectionTimeout < 0
HA cycling test: verify leader fencing reduces unconfirmed-block window on SIGTERM
HA cycling test: verify rebooted nodes resync within cycle window after snapshot config change

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Best-effort leadership resignation on shutdown for smoother cluster handoff
Bug Fixes
- Improved shutdown cleanup to avoid lingering apply callbacks
- Safer recovery when local state is ahead of stale snapshots
- Leaders now wait or abdicate when local store lags raft
Configuration
- New Raft tuning options with sensible defaults (election timeout, snapshot threshold, trailing logs)
Tests
- Added unit tests covering resignation, recovery, and election wait behavior

coderabbitai · 2026-04-07T13:47:05Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds context-aware Raft leader resignation, extends Raft configuration (election/snapshot/retention/timeouts), hardens FSM apply concurrency, improves recovery when local state is ahead of Raft, introduces leader abdication when local store lags Raft, and updates shutdown sequencing to detach apply callbacks.

Changes

Cohort / File(s)	Summary
Raft Core Config & Node `pkg/raft/node.go`, `pkg/raft/types.go`, `pkg/raft/node_test.go`	Adds `ShutdownTimeout`, `ElectionTimeout`, `SnapshotThreshold`, `TrailingLogs`; new `ResignLeader(ctx)` method; `buildRaftConfig` helper applies tuning and routes hclog -> zerolog; `RaftApplyMsg` gains `Term`; FSM apply callback access is guarded by an `RWMutex`; tests for config/ResignLeader added.
Raft Election Logic `pkg/raft/election.go`, `pkg/raft/election_test.go`	New wait-for-store-sync polling when store lags raft by >1 block; attempts leadership transfer and aborts leader start if sync never completes; tests for abdication and catch-up scenarios added.
Block Syncing & Retriever Shutdown `block/internal/syncing/syncer.go`, `block/internal/syncing/raft_retriever.go`, `block/internal/syncing/syncer_test.go`	`Syncer.Stop()` now stops `raftRetriever`; `raftRetriever.Stop()` clears raft apply callback after waiting for apply loop; `RecoverFromRaft` now verifies local header hash when local height > raft snapshot height and conditionally skips recovery; tests for these behaviors added.
Full Node & Interface `node/full.go`, `node/node.go`, `node/full_node_test.go`	Adds `LeaderResigner` interface (`ResignLeader(ctx) error`), `FullNode.ResignLeader` implementation and compile-time assertion; noop tests added.
Startup/Shutdown Orchestration `pkg/cmd/run_node.go`	On SIGINT/SIGTERM, attempts best-effort leader resignation via `LeaderResigner` with 3s timeout before cancelling worker context; logs outcomes.
Configuration Management & Defaults `pkg/config/config.go`, `pkg/config/defaults.go`, `pkg/config/config_test.go`	Adds Raft config fields and CLI flags: `ElectionTimeout`, `SnapshotThreshold`, `TrailingLogs`; validates `ElectionTimeout >= HeartbeatTimeout` when >0; sets defaults (`ElectionTimeout: 1000ms`, `SnapshotThreshold: 500`, `TrailingLogs: 200`, `SnapCount: 3`) and updates tests.
Misc: Changelog & deps `CHANGELOG.md`, `go.mod`	Changelog entry for Raft HA hardening; adds `github.com/hashicorp/go-hclog` direct require.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (SIGINT/SIGTERM)
    participant Runner as Run Control
    participant Node as FullNode
    participant RaftNode as RaftNode
    participant FSM as FSM/Apply

    CLI->>Runner: Signal received
    Runner->>Node: type-assert LeaderResigner
    alt supports ResignLeader
        Runner->>Node: ResignLeader(ctx, 3s)
        Node->>RaftNode: ResignLeader(ctx)
        RaftNode->>RaftNode: Attempt leadership transfer
        RaftNode-->>Node: Success/Error
        Node-->>Runner: return result
        Runner->>Runner: Log outcome
    else not supported
        Runner->>Runner: Skip resignation
    end
    Runner->>Runner: cancel worker context
    Runner->>Runner: wait for shutdown (10s)

sequenceDiagram
    participant Election as Election Loop
    participant Store as Local Store
    participant Transfer as leadershipTransfer()
    participant Runnable as Leader Runnable

    Election->>Store: IsSynced(raftState)
    Store-->>Election: diff = -5 (lag > 1)
    alt store significantly behind
        Election->>Election: log warning and poll IsSynced until timeout
        Election->>Transfer: leadershipTransfer()
        Transfer-->>Election: transfer initiated
        Election->>Election: abort leader start
    else store catches up
        Election->>Runnable: proceed with leader startup
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: Publisher-mode synchronization option for failover scenario #3222: Directly modifies block/internal/syncing/syncer.go RecoverFromRaft logic for ahead-of-raft handling and verification.

Suggested reviewers

alpe
chatton
tac0turtle

Poem

🐰
I nibbled at the Raft logs late at night,
Tuned timeouts, handed crowns with gentle might,
Callbacks untangled, snapshots set to keep,
Graceful goodbyes before the node sleeps deep. 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main changes: raft HA production hardening with focus on leader fencing, log compaction, election timeout, and audit log.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering all major changes with clear sections, test plans, and sufficient context to understand the goals and implementation details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-07T13:49:01Z

Claude finished @auricom's task in 5m 27s —— View job

Code Review

Read AGENTS.md and CLAUDE.md for repo-specific guidelines
Read changed files: election.go, node.go, run_node.go, config.go, syncer.go, full.go, raft_retriever.go, types.go
Analyze code quality, concurrency, error handling, and testing
Post comprehensive review

This is a solid and well-reasoned set of changes. The PR description and inline comments demonstrate clear understanding of the failure modes. Overall the implementation is production-ready with a few remaining items to address.

Critical / Must-fix

1. waitForBlockStoreSync — panic if pollInterval == 0 [pkg/raft/election.go:319]

pollInterval := min(100*time.Millisecond, timeout/10)
ticker := time.NewTicker(pollInterval)  // panics if pollInterval == 0

timeout falls back to 5 * cfg.SendTimeout, but if SendTimeout is zero (e.g., a raw Config{} in a test or integration scenario that bypasses Validate()), timeout remains 0 and NewTicker(0) panics. Validate() prevents this in production, but the function isn't defensive enough for direct callers. Add a floor:

pollInterval := max(time.Millisecond, min(100*time.Millisecond, timeout/10))

Fix this →

2. RaftConfig.Validate() does not reject SnapCount == 0 [pkg/config/config.go:419]

The default was correctly changed from 0 → 3, but Validate() doesn't guard against an operator explicitly setting snap_count: 0. raft.NewFileSnapshotStore(dir, 0, ...) will fail at node startup, not at config validation — a poor operator experience. Add:

if c.SnapCount == 0 {
    multiErr = errors.Join(multiErr, fmt.Errorf("snap_count must be > 0"))
}

Fix this →

Major / Should-fix

3. IsSynced error silently discarded on deadline path [pkg/raft/election.go:327-331]

case <-deadline.C:
    diff, err := r.IsSynced(d.node.GetState())
    if err == nil && diff >= -1 {
        return syncResultSynced
    }
    return syncResultTimeout  // err is dropped here

When IsSynced returns an error on the final check, syncResultTimeout is returned and the error disappears from logs entirely. Leadership transfer then follows. This error could contain critical diagnostic information (e.g., store unavailable). Log it before returning:

case <-deadline.C:
    diff, err := r.IsSynced(d.node.GetState())
    if err != nil {
        d.logger.Warn().Err(err).Msg("final IsSynced check failed during block-store sync wait")
    } else if diff >= -1 {
        return syncResultSynced
    }
    return syncResultTimeout

Fix this →

4. Leadership transfer failure exits Run() entirely [pkg/raft/election.go:161-163]

When syncResultTimeout is returned and leadershipTransfer() fails, Run() returns an error, terminating the entire election loop. This is a terminal event at the DynamicLeaderElection level. The current log at line 162 says Error but the return at 163 propagates the error upward. Make sure callers handle this as a fatal event and surface it clearly. This is probably fine as-is, but deserves a brief restart-required comment for operational clarity.

Minor / Nitpick

5. Duplicate log message in Run() [pkg/raft/election.go:124 and ~187]

Line 124 logs "became leader, stopping follower operations" at the start of the isStarted block. Line 187 logs the same string again, right before workerCancel(). The second one is redundant — consider changing it to "stopping follower worker before starting leader" to distinguish the two phases in traces.

Fix this →

6. os.Stderr passed to TCP transport and snapshot store [pkg/raft/node.go:117,127]

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)
transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

The PR adds a proper hclog → zerolog bridge for raftConfig.Logger but leaves these two sinks writing raw to os.Stderr. These could accept logger.With().Str("component","raft-snapshot").Logger() and logger.With().Str("component","raft-transport").Logger() respectively (zerolog.Logger implements io.Writer). Not introduced by this PR, but a missed opportunity while node.go was already touched.

Concurrency / Safety Review

FSM apply mutex (applyMu sync.RWMutex): Correctly introduced. SetApplyCallback holds the write lock; Apply holds the read lock only while copying the channel pointer, then releases before sending. No channel send happens under lock. Clean.

raftRetriever.Stop() ordering: cancel() → wg.Wait() (loop exits) → SetApplyCallback(nil). After wg.Wait(), the consumer loop has exited. SetApplyCallback(nil) clears the reference. If FSM.Apply fires concurrently (raft is still up), the buffered select { default } pattern handles full-channel or nil-channel cases safely via the mutex. Correct.

ResignLeader goroutine: The inner goroutine (done <- n.leadershipTransfer()) may outlive ResignLeader when the context expires. The done channel has capacity 1, so the goroutine will drain and exit without leaking. The transfer itself is bounded by ElectionTimeout. This is explicitly called out in the PR and is acceptable.

waitForMsgsLanded one-shot timer fix: The switch from a repeating deadline ticker to time.NewTimer eliminates the race where simultaneous tick and deadline fires could cause a spurious timeout even when AppliedIndex >= CommitIndex. The final check in the deadline.C branch is correct.

Testing

Table-driven tests in node_test.go, config_test.go, and full_node_test.go follow the repo guidelines.
The "abdicate when store significantly behind raft" scenario using time.After(200ms) as a negative assertion window (line 272) is better than the previous time.Sleep(10ms) approach. Still relies on wall-clock, but the 200ms window is wide enough to be reliable.
Test coverage for pkg/raft/node.go is at 39.62% per Codecov — the main uncovered paths are in Stop(), leadershipTransfer error handling, and the BoltDB store creation. These are hard to unit-test without a real raft cluster, but an integration-level test for the stop sequence would catch the waitForMsgsLanded panic vector above.

Positive Observations

The syncResult enum cleanly disambiguates the three false cases from the previous boolean return — a significant correctness improvement.
ShutdownTimeout clamping in NewNode (clamp on construction, not at call sites) correctly handles zero-valued test configs.
RecoverFromRaft hash verification (currentState.LastBlockHeight > raftState.Height branch) is a solid defensive check — the explicit fork-detection error message will be invaluable for post-mortem.
The hot-potato fix (waitForBlockStoreSync giving the fastest-syncing peer time to stabilise) is well-reasoned and matches the described failure mode exactly.
buildRaftConfig hclog bridge is the right approach for structured log consistency.

github-actions · 2026-04-07T13:50:56Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Apr 20, 2026, 8:47 AM

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)

block/internal/syncing/raft_retriever.go
block/internal/syncing/raft_retriever_test.go
pkg/raft/node.go

codecov · 2026-04-08T00:22:58Z

Codecov Report

❌ Patch coverage is 58.94040% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.48%. Comparing base (4c7323f) to head (a434761).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/raft/node.go	39.62%	32 Missing ⚠️
pkg/raft/election.go	69.23%	12 Missing and 4 partials ⚠️
pkg/cmd/run_node.go	0.00%	10 Missing ⚠️
node/full.go	50.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   62.33%   62.48%   +0.15%     
==========================================
  Files         122      122              
  Lines       12877    13012     +135     
==========================================
+ Hits         8027     8131     +104     
- Misses       3970     3995      +25     
- Partials      880      886       +6

Flag	Coverage Δ
combined	`62.48% <58.94%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Bug A: RecoverFromRaft crashed with "invalid block height" when the node restarted after SIGTERM and the EVM state (persisted before kill) was ahead of the raft FSM snapshot (which hadn't finished log replay yet). The function now verifies the hash of the local block at raftState.Height — if it matches the snapshot hash the EVM history is correct and recovery is safely skipped; a mismatch returns an error indicating a genuine fork. Bug B: waitForMsgsLanded used two repeating tickers with the same effective period (SendTimeout/2 poll, SendTimeout timeout), so both could fire simultaneously in select and the timeout would win even when AppliedIndex >= CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer, added a final check in the deadline branch, and reduced poll interval to min(50ms, timeout/4) for more responsive detection. Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test cycle 7 (poc-ha-2 never rejoining within the 300s kill interval). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SetApplyCallback(nil) called from raftRetriever.Stop() raced with FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine exited, but the raft library can still invoke Apply concurrently. Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and read lock in Apply before reading the channel pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…GTERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… to RaftConfig; fix SnapCount default 0→3 Add three new Raft config parameters: - ElectionTimeout: timeout for candidate to wait for votes (defaults to 1s) - SnapshotThreshold: outstanding log entries that trigger snapshot (defaults to 500) - TrailingLogs: log entries to retain after snapshot (defaults to 200) Fix critical default: SnapCount was 0 (broken, retains no snapshots) → 3 This enables control over Raft's snapshot frequency and recovery behavior to prevent resync debt from accumulating unbounded during normal operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nto hashicorp/raft config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r block provenance audit Add Term field to RaftApplyMsg struct to track the raft term in which each block was committed. Update FSM.Apply() debug logging to include both raft_term and raft_index fields alongside block height and hash. This enables better audit trails and debugging of replication issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…gering tests The gci formatter requires single space before inline comments (not aligned double-space). Also removed TestNodeResignLeader_NotLeaderNoop and TestNewNode_SnapshotConfigApplied which create real boltdb-backed raft nodes: boltdb@v1.3.1 has an unsafe pointer alignment issue that panics under Go 1.25's -checkptr. The nil-receiver test (TestNodeResignLeader_NilNoop) is retained as it exercises the same guard without touching boltdb. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hashicorp/raft-boltdb uses defer tx.Rollback() as a safety net on every transaction. When Commit() succeeds, the deferred Rollback() returns bolt.ErrTxClosed and raft-boltdb logs it as an error — even though it is the expected outcome of every successful read or write. The message has no actionable meaning and floods logs at high block rates. Add a one-time stdlib log filter (sync.Once in NewNode) that silently drops lines containing 'tx closed' and forwards everything else to stderr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ocs, tests - Call raftRetriever.Stop() in Syncer.Stop() so SetApplyCallback(nil) is actually reached and the goroutine is awaited before wg.Wait() - Log leadershipTransfer error at warn level in Node.Stop() instead of discarding it silently - Fix SnapCount comments in config.go: it retains snapshot files on disk (NewFileSnapshotStore retain param), not log-entry frequency - Extract buildRaftConfig helper from NewNode to enable config wiring tests - Add TestNodeResignLeader_NotLeaderNoop (non-nil node, nil raft → noop) - Add TestNewNode_SnapshotConfigApplied (table-driven, verifies SnapshotThreshold and TrailingLogs wiring with custom and zero values) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, election validation - Add ShutdownTimeout field (default 5s) to raft Config so Stop() drains committed logs with a meaningful timeout instead of the 200ms SendTimeout - Wrap ResignLeader() in a 3s goroutine+select fence in the SIGTERM handler so a hung leadership transfer cannot block graceful shutdown indefinitely - Validate ElectionTimeout >= HeartbeatTimeout in RaftConfig.Validate() to prevent hashicorp/raft panicking at startup with an invalid config - Fix stale "NewNode creates a new raft node" comment that had migrated onto buildRaftConfig after the function was extracted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gofmt/gci requires minimal alignment; excessive spaces in the TestNewNode_SnapshotConfigApplied struct literal caused a lint failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add unit tests for lines flagged by Codecov: - boltTxClosedFilter.Write: filter drops "tx closed", forwards others - buildRaftConfig: ElectionTimeout > 0 applied, zero uses default - FullNode.ResignLeader: nil raftNode no-op; non-leader raftNode no-op - Syncer.Stop: raftRetriever.Stop is called when raftRetriever is set - Syncer.RecoverFromRaft: GetHeader failure when local state is ahead of stale raft snapshot returns "cannot verify hash" error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

node/full_node_test.go (1)

87-97: Prefer a table-driven test for the ResignLeader no-op cases.

These two tests are the same shape and are easier to extend/maintain as one table-driven test.

♻️ Proposed refactor

-func TestFullNode_ResignLeader_NilRaftNode(t *testing.T) {
-	n := &FullNode{} // raftNode is nil
-	assert.NoError(t, n.ResignLeader())
-}
-
-func TestFullNode_ResignLeader_NonLeaderRaftNode(t *testing.T) {
-	// Empty *raftpkg.Node has nil raft field so IsLeader() returns false;
-	// ResignLeader() is a no-op and returns nil.
-	n := &FullNode{raftNode: &raftpkg.Node{}}
-	assert.NoError(t, n.ResignLeader())
-}
+func TestFullNode_ResignLeader_NoOpCases(t *testing.T) {
+	tests := []struct {
+		name string
+		node *FullNode
+	}{
+		{name: "nil raft node", node: &FullNode{}},
+		{name: "non-leader raft node", node: &FullNode{raftNode: &raftpkg.Node{}}},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			assert.NoError(t, tt.node.ResignLeader())
+		})
+	}
+}

As per coding guidelines, **/*_test.go: "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@node/full_node_test.go` around lines 87 - 97, Combine the two ResignLeader
no-op tests into a single table-driven test: create a slice of cases with
descriptive names for the nil-raftNode case and the non-leader raftNode case,
each providing a FullNode value (e.g., &FullNode{} and &FullNode{raftNode:
&raftpkg.Node{}}) and the expected outcome (no error); iterate over cases using
t.Run(case.name, func(t *testing.T) { assert.NoError(t,
case.node.ResignLeader()) }), keeping references to FullNode.ResignLeader and
raftpkg.Node to locate the code to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cmd/run_node.go`:
- Around line 231-245: The current shutdown uses a 3s resignCtx but launches
resigner.ResignLeader() without cancellation so the resign can keep running;
change the API to accept a context (e.g., LeaderResigner.ResignLeader(ctx
context.Context) error), call resigner.ResignLeader(resignCtx) directly (or in a
goroutine and select on its returned error and resignCtx.Done()) and
propagate/observe the context to abort the resign operation; update the call
site in run_node.go to pass resignCtx and handle the returned error or context
timeout to avoid leaking a background goroutine or leaving an unbounded resign
in progress.

In `@pkg/config/config.go`:
- Around line 454-456: The validation must reject negative election timeouts:
add an explicit check in the same validation block to treat c.ElectionTimeout <
0 as invalid (join an error like "election timeout (%v) must be >= 0" into
multiErr), keeping 0 as the "use default" sentinel; ensure this complements the
existing comparison with c.HeartbeatTimeout and aligns with buildRaftConfig's
behavior (which only applies values > 0) so negative values fail fast instead of
being silently ignored.

In `@pkg/raft/node.go`:
- Around line 95-97: The code currently calls
log.SetOutput(&boltTxClosedFilter{w: os.Stderr}) inside suppressBoltNoise.Do(),
which clobbers the process-wide stdlib logger; instead capture the current
stdlib logger writer and wrap it so existing configuration is preserved: obtain
prev := log.Writer() and call log.SetOutput(&boltTxClosedFilter{w: prev}) (still
guarded by suppressBoltNoise.Do()). Update the usage in the same block where
suppressBoltNoise.Do and boltTxClosedFilter are referenced (e.g., in NewNode
initialization) so you no longer hard-code os.Stderr.

---

Nitpick comments:
In `@node/full_node_test.go`:
- Around line 87-97: Combine the two ResignLeader no-op tests into a single
table-driven test: create a slice of cases with descriptive names for the
nil-raftNode case and the non-leader raftNode case, each providing a FullNode
value (e.g., &FullNode{} and &FullNode{raftNode: &raftpkg.Node{}}) and the
expected outcome (no error); iterate over cases using t.Run(case.name, func(t
*testing.T) { assert.NoError(t, case.node.ResignLeader()) }), keeping references
to FullNode.ResignLeader and raftpkg.Node to locate the code to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bff68b93-3da9-4c95-9c3b-ae84e849204f

📥 Commits

Reviewing files that changed from the base of the PR and between 2d28b20 and a2a2599.

📒 Files selected for processing (12)

block/internal/syncing/syncer.go
block/internal/syncing/syncer_test.go
node/full.go
node/full_node_test.go
node/node.go
pkg/cmd/run_node.go
pkg/config/config.go
pkg/config/config_test.go
pkg/config/defaults.go
pkg/raft/node.go
pkg/raft/node_test.go
pkg/raft/types.go

A negative ElectionTimeout was silently ignored (buildRaftConfig only applies values > 0), allowing a misconfigured node to start with the library default instead of failing fast. Add an explicit < 0 check that returns an error; 0 remains valid as the "use library default" sentinel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…x through ResignLeader - suppressBoltNoise.Do now wraps log.Writer() instead of os.Stderr so any existing stdlib logger redirection is preserved rather than clobbered - ResignLeader now accepts a context.Context: leadershipTransfer runs in a goroutine and a select abandons the caller at ctx.Done(), returning ctx.Err(); the goroutine itself exits once the inner raft transfer completes (bounded by ElectionTimeout) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

julienrbrt

nice work!

julienrbrt · 2026-04-13T12:20:46Z

this test does not look necessary

Done — deleted in bf6bb50. The test was validating an implementation detail (that Stop() calls SetApplyCallback(nil)) rather than observable behaviour. The stubRaftNode helper it defined has been moved to syncer_test.go where it's still needed by TestSyncer_Stop_CallsRaftRetrieverStop.

julienrbrt · 2026-04-13T12:24:35Z

+}
+
+func NewNode(cfg *Config, logger zerolog.Logger) (*Node, error) {
+	suppressBoltNoise.Do(func() {


Let's delete this, this is does not use the correct logger anyway.

Done — deleted in bf6bb50. The global stdlib log.SetOutput redirect was the wrong scope; nothing else in the codebase should be silenced that way.

julienrbrt · 2026-04-13T12:24:41Z

 	"google.golang.org/protobuf/proto"
 )

+// suppressBoltNoise redirects the stdlib log output once to drop the


ditto, if we don't need log from this, and if it uses the stdlib log implementation, we should make sure hashicorp/raft-boltdb is instantiated with zerolog

Unfortunately hashicorp/raft-boltdb does not support logger injection — its Options struct has no Logger field and all rollback errors are emitted via hardcoded log.Printf calls. There is no way to route those through zerolog without patching the library upstream. The "Rollback failed: tx closed" noise is benign (it fires every time defer tx.Rollback() runs after a successful tx.Commit()), so for now we accept it.

julienrbrt · 2026-04-13T12:26:44Z

 	// Wait for FSM to apply all committed logs before shutdown to prevent state loss.
 	// This ensures pending raft messages are processed before the node stops.
-	if err := n.waitForMsgsLanded(n.config.SendTimeout); err != nil {
+	if err := n.waitForMsgsLanded(n.config.ShutdownTimeout); err != nil {


Do we really need a different ShutdownTimeout vs SendTimeout? That's a log of timeouts in the config, which is confusing for operators (like cometbft has thousands of confusing timeouts)

Agreed — removed from public config in bf6bb50. ShutdownTimeout is still computed internally as 5 × SendTimeout at the initRaftNode call site (node/full.go), so the drain window scales with the operator's chosen send timeout without exposing another knob. The ShutdownTimeout field stays in pkg/raft.Config as an internal detail.

…t state When a node wins election but its local store is more than 1 block behind the raft FSM state, RecoverFromRaft cannot replay the gap (it only handles the single latest block in the raft snapshot). Previously the node would log "recovery successful" and start leader operations anyway, then stall block production while the executor repeatedly failed to sync the missing blocks from a store that did not have them. Fix: in DynamicLeaderElection.Run, detect diff < -1 at the follower→leader transition and immediately transfer leadership to a better-synced peer. diff == -1 is preserved: RecoverFromRaft can apply exactly one block from the raft snapshot, so that path is unchanged. Closes #3255 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nTimeout - Remove stdlib log filter (boltTxClosedFilter / suppressBoltNoise): it redirected the global stdlib logger which is the wrong scope. raft-boltdb uses log.Printf directly with no Logger option, so the "tx closed" noise is now accepted as-is from stderr. - Wire hashicorp/raft's internal hclog output through zerolog: set raft.Config.Logger to an hclog.Logger backed by the zerolog writer so all raft-internal messages appear in the structured log stream under component=raft-hashicorp. - Remove ShutdownTimeout from public config: it was a second "how long to wait" knob that confused operators. ShutdownTimeout is now computed internally as 5 × SendTimeout at the initRaftNode call site. - Delete TestRaftRetrieverStopClearsApplyCallback: tested an implementation detail (Stop clears the apply callback pointer) rather than observable behaviour. The stubRaftNode helper it defined is moved to syncer_test.go where it is still needed. - Rename TestNewNode_SnapshotConfigApplied → TestBuildRaftConfig_SnapshotConfigApplied to reflect that it tests buildRaftConfig, not NewNode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…test go mod tidy promotes github.com/hashicorp/go-hclog from indirect to direct now that pkg/raft/node.go imports it explicitly. gci auto-formatted stubRaftNode method stubs in syncer_test.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/config/config.go (1)
419-452: ⚠️ Potential issue | 🟠 Major

Validate raft.snap_count here too.

SnapCount is still allowed to be 0, but pkg/raft/node.go passes it straight into raft.NewFileSnapshotStore(...), which will fail during node startup instead of during config validation. Since this is now a user-facing retention knob, reject zero in RaftConfig.Validate() so bad operator input fails fast.
Suggested fix
 func (c RaftConfig) Validate() error {
 	if !c.Enable {
 		return nil
 	}
 	var multiErr error
@@
+	if c.SnapCount == 0 {
+		multiErr = errors.Join(multiErr, fmt.Errorf("snap count must be > 0"))
+	}
+
 	if c.SendTimeout <= 0 {
 		multiErr = errors.Join(multiErr, fmt.Errorf("send timeout must be positive"))
 	}
As per coding guidelines, "Validate all inputs from external sources in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/config.go` around lines 419 - 452, RaftConfig.Validate currently
misses validating SnapCount; add a check in the RaftConfig.Validate() method to
reject SnapCount == 0 (or <= 0) and join an appropriate error (e.g.,
fmt.Errorf("snap count must be positive")) into multiErr so invalid config fails
fast; reference the RaftConfig type and its Validate() method (and note that
raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count)
when locating where to add the validation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/raft/election_test.go`:
- Around line 245-261: The test is nondeterministic due to time.Sleep and
t.Fatal in a goroutine; replace those with channel-based synchronization: in the
testRunnable used as leader (testRunnable.runFn) send a message on a new
leaderStartAttemptCh instead of calling t.Fatal, and have the main test
goroutine wait on that channel to observe the leader start attempt; remove
time.Sleep and instead use the existing leaderCh/fStarted or a new
leaderStartedCh to coordinate the leader transfer (send/receive to trigger state
transitions) and then perform assertions on the captured signals from
leaderStartAttemptCh and leaderStartedCh; update where DynamicLeaderElection is
constructed (leaderFactory/followerFactory) to return the testRunnable that uses
these channels for deterministic coordination.

In `@pkg/raft/election.go`:
- Around line 147-150: The current code swallows failures from
node.leadershipTransfer() by logging and continuing, which can leave the node
stuck; change the behavior so that transfer failures are surfaced instead of
silently continuing: inside the loop where node.leadershipTransfer() is called
(the call shown as d.node.leadershipTransfer() with the logger d.logger),
replace the bare continue on error with logic that returns or propagates tErr to
the caller (or otherwise transitions the state to abort leader startup) so the
failure is visible and handled upstream; ensure the log still records the error
before returning/propagating.

In `@pkg/raft/node.go`:
- Around line 84-90: NewNode currently can leave cfg.ShutdownTimeout as zero
which causes a panic in Stop() -> waitForMsgsLanded when creating a ticker with
interval zero; fix this by initializing/clamping the internal shutdown timeout
in NewNode before the config is stored: if cfg.ShutdownTimeout <= 0 set it to
the internal default (e.g. 5 * cfg.SendTimeout) or a sane minimum, so callers of
Stop() / waitForMsgsLanded always see a positive timeout value.

---

Outside diff comments:
In `@pkg/config/config.go`:
- Around line 419-452: RaftConfig.Validate currently misses validating
SnapCount; add a check in the RaftConfig.Validate() method to reject SnapCount
== 0 (or <= 0) and join an appropriate error (e.g., fmt.Errorf("snap count must
be positive")) into multiErr so invalid config fails fast; reference the
RaftConfig type and its Validate() method (and note that
raft.NewFileSnapshotStore in pkg/raft/node.go expects a positive snap count)
when locating where to add the validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4087f79-ce2a-422e-a24c-41233d9807d3

📥 Commits

Reviewing files that changed from the base of the PR and between 2d3dc8e and bf6bb50.

📒 Files selected for processing (9)

block/internal/syncing/syncer_test.go
node/full.go
pkg/config/config.go
pkg/config/config_test.go
pkg/config/defaults.go
pkg/raft/election.go
pkg/raft/election_test.go
pkg/raft/node.go
pkg/raft/node_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/config/config_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/config/defaults.go
node/full.go
block/internal/syncing/syncer_test.go

…nsfer error propagation, deterministic test ShutdownTimeout zero-value panic (critical): NewNode now clamps ShutdownTimeout to 5*SendTimeout when the caller passes zero, preventing a panic in time.NewTicker inside waitForMsgsLanded. The normal path through initRaftNode already sets it explicitly; this guard protects direct callers (e.g. tests) that omit the field. Leadership transfer error propagation (major): When store-lag abdication calls leadershipTransfer() and it fails, the error is now returned instead of being logged and silently continuing. Continuing after a failed transfer left the node as leader-without-worker, stalling the cluster. Deterministic abdication test (major): Replace time.Sleep(10ms) + t.Fatal-in-goroutine with channel-based synchronization: leader runFn signals leaderStarted; the test goroutine waits up to 50ms for that signal and calls t.Error (safe from goroutines) if it arrives, then cancels the context either way. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

julienrbrt

the changes look great, could you add a changelog?

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

julienrbrt · 2026-04-16T12:39:32Z

 					if err != nil {
 						return err
 					}
+					if diff < -1 {


this may be too eager, what if all nodes in the rafts are behind, they will just pass each other the raft leadership without having time to catchup themselves.

Confirmed this exact scenario in the local tests. When all nodes restart simultaneously, their block stores lag the raft FSM height (block data arrives via p2p, not raft). When restarted all nodes at once, raft cluster entered a hot-potato loop with no stable leader and no block production.

Fixed in 5897ef3: before abdicating, the new waitForBlockStoreSync helper polls IsSynced for up to ShutdownTimeout (~1s). The fastest-syncing peer stabilises as leader; nodes that can't catch up in time still abdicate normally. Lost-leadership is also detected mid-wait so the wait aborts early if another node takes over.

…tion When all nodes restart simultaneously their block stores can lag behind the raft FSM height (block data arrives via p2p, not raft). With the previous code every elected node saw diff < -1 and immediately called leadershipTransfer(), creating an infinite hot-potato: no node ever stabilised as leader and block production stalled. Instead of abdicating immediately, the new waitForBlockStoreSync helper polls IsSynced for up to ShutdownTimeout (default ~1s). The fastest- syncing peer proceeds as leader; nodes that cannot catch up in time still abdicate and yield to a better candidate. Leadership also checks mid-wait so a lost-leadership event aborts the wait early. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/raft/election.go (1)
310-322: Minor: the final deadline check swallows IsSynced error; poll interval assumes non-zero SendTimeout.

Line 312–313: if IsSynced returns an error on the final post-deadline check, it's dropped silently and mapped to "not synced". Logging it at debug/warn would aid post-mortem of abdications.

Line 299 / 303: if cfg.ShutdownTimeout <= 0 and cfg.SendTimeout is ever 0 (e.g., a test Config{} that bypasses Validate()), timeout stays 0 and time.NewTicker(timeout/10) panics with a non-positive interval. A small floor (if timeout <= 0 { timeout = defaultShutdownTimeout } or max(timeout, someMin)) would make the helper robust against zero-valued configs used in tests/mocks.
Suggested tweak
 	timeout := cfg.ShutdownTimeout
 	if timeout <= 0 {
 		timeout = 5 * cfg.SendTimeout
 	}
+	if timeout <= 0 {
+		timeout = 5 * time.Second
+	}
 	deadline := time.NewTimer(timeout)
 	defer deadline.Stop()
-	pollInterval := min(100*time.Millisecond, timeout/10)
+	pollInterval := max(time.Millisecond, min(100*time.Millisecond, timeout/10))
 		case <-deadline.C:
 			// Final check before giving up.
 			diff, err := r.IsSynced(d.node.GetState())
+			if err != nil {
+				d.logger.Warn().Err(err).Msg("final IsSynced check failed after block-store sync wait")
+			}
 			return err == nil && diff >= -1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/election.go` around lines 310 - 322, The final deadline branch
currently discards an error from r.IsSynced (case <-deadline.C) and maps it to
"not synced": change that branch to log the error at debug/warn using the
component's logger (use whatever logger is available on r or d.node) before
returning false when err != nil, so the error is preserved in logs for
post-mortem; also harden the poll/ticker creation by guarding the computed
timeout (derived from cfg.ShutdownTimeout / cfg.SendTimeout) so it can never be
<= 0 — e.g., if timeout <= 0 set timeout = a sane default (or max(timeout,
minTimeout)) before calling time.NewTicker(timeout/10) to avoid a panic when
cfg.ShutdownTimeout or cfg.SendTimeout are zero in tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/raft/election.go`:
- Around line 150-169: waitForBlockStoreSync currently returns false for three
distinct reasons (canceled ctx, timeout, or lost leadership) but the caller in
Run unconditionally calls d.node.leadershipTransfer() on false; change
waitForBlockStoreSync to return a small sentinel/enum (e.g., syncResultSynced,
syncResultTimeout, syncResultLostLeadership, syncResultCanceled) and adjust Run:
if syncResultLostLeadership -> simply continue (do not call
d.node.leadershipTransfer()), if syncResultTimeout -> attempt
d.node.leadershipTransfer() and handle/errors as before, if syncResultCanceled
-> return ctx.Err(), and if syncResultSynced -> refresh raftState and proceed;
update any references to NodeID/leaderID checks inside waitForBlockStoreSync
accordingly.

---

Nitpick comments:
In `@pkg/raft/election.go`:
- Around line 310-322: The final deadline branch currently discards an error
from r.IsSynced (case <-deadline.C) and maps it to "not synced": change that
branch to log the error at debug/warn using the component's logger (use whatever
logger is available on r or d.node) before returning false when err != nil, so
the error is preserved in logs for post-mortem; also harden the poll/ticker
creation by guarding the computed timeout (derived from cfg.ShutdownTimeout /
cfg.SendTimeout) so it can never be <= 0 — e.g., if timeout <= 0 set timeout = a
sane default (or max(timeout, minTimeout)) before calling
time.NewTicker(timeout/10) to avoid a panic when cfg.ShutdownTimeout or
cfg.SendTimeout are zero in tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c5342a4-9b9f-4a92-8a70-4e966461348c

📥 Commits

Reviewing files that changed from the base of the PR and between bf6bb50 and 03e33c3.

📒 Files selected for processing (7)

CHANGELOG.md
block/internal/syncing/syncer.go
block/internal/syncing/syncer_test.go
go.mod
pkg/raft/election.go
pkg/raft/election_test.go
pkg/raft/node.go

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.md

🚧 Files skipped from review as they are similar to previous changes (4)

block/internal/syncing/syncer.go
block/internal/syncing/syncer_test.go
pkg/raft/node.go
pkg/raft/election_test.go

waitForBlockStoreSync previously returned bool, conflating three distinct failure modes (ctx canceled, timeout, lost leadership). The caller in Run then unconditionally called leadershipTransfer() on any false return, which is wrong when leadership was already lost. Introduce a syncResult enum (syncResultSynced, syncResultTimeout, syncResultLostLeadership, syncResultCanceled) and update Run to handle each case correctly: - syncResultCanceled → return ctx.Err() - syncResultLostLeadership → continue without calling leadershipTransfer() - syncResultTimeout → leadershipTransfer() + continue as before - syncResultSynced → refresh raftState/diff and proceed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

julienrbrt

utACK

Fix raft leader handoff regression after SIGTERM

2d28b20

github-actions bot assigned auricom Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

Comment thread block/internal/syncing/raft_retriever.go

auricom marked this pull request as draft April 7, 2026 15:24

Merge branch 'main' into fix/3229-raft-re-election

857b44b

auricom and others added 2 commits April 8, 2026 17:03

auricom changed the title ~~Fix raft leader re-election delays after SIGTERM~~ fix: raft leader re-election delays after SIGTERM Apr 8, 2026

auricom and others added 6 commits April 9, 2026 15:14

feat(raft): add ResignLeader() public method on Node

b8471f0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(node): implement LeaderResigner interface on FullNode

c6b1a5f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(shutdown): resign raft leadership before cancelling context on SI…

4cdfc54

…GTERM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(raft): wire snapshot_threshold, trailing_logs, election_timeout i…

cc39c9a

…nto hashicorp/raft config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom changed the title ~~fix: raft leader re-election delays after SIGTERM~~ fix: raft HA production hardening — leader fencing, log compaction, election timeout, audit log Apr 9, 2026

auricom and others added 6 commits April 9, 2026 16:01

style(raft): fix gci struct field alignment in node_test.go

9ed4946

gofmt/gci requires minimal alignment; excessive spaces in the TestNewNode_SnapshotConfigApplied struct literal caused a lint failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom marked this pull request as ready for review April 10, 2026 16:24

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

Comment thread pkg/cmd/run_node.go

Comment thread pkg/config/config.go Outdated

Comment thread pkg/raft/node.go Outdated

julienrbrt self-requested a review April 10, 2026 17:20

auricom and others added 2 commits April 10, 2026 19:22

julienrbrt reviewed Apr 13, 2026

View reviewed changes

auricom and others added 4 commits April 15, 2026 12:41

Merge branch 'main' into fix/3229-raft-re-election

7e09507

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/raft/election_test.go

Comment thread pkg/raft/election.go Outdated

Comment thread pkg/raft/node.go

julienrbrt reviewed Apr 15, 2026

View reviewed changes

auricom added a commit that referenced this pull request Apr 15, 2026

docs(changelog): add unreleased entries for raft HA hardening PR #3230

cd82967

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom added a commit that referenced this pull request Apr 15, 2026

docs(changelog): add unreleased entries for raft HA hardening (#3230)

e161e41

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom force-pushed the fix/3229-raft-re-election branch from cd82967 to e161e41 Compare April 15, 2026 14:18

docs(changelog): add unreleased entries for raft HA hardening (#3230)

84ec0d0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

auricom force-pushed the fix/3229-raft-re-election branch from e161e41 to 84ec0d0 Compare April 15, 2026 14:19

julienrbrt reviewed Apr 16, 2026

View reviewed changes

auricom marked this pull request as draft April 16, 2026 12:42

auricom marked this pull request as ready for review April 20, 2026 08:22

Merge branch 'main' into fix/3229-raft-re-election

03e33c3

auricom enabled auto-merge April 20, 2026 08:22

coderabbitai bot reviewed Apr 20, 2026

View reviewed changes

Comment thread pkg/raft/election.go Outdated

auricom and others added 2 commits April 20, 2026 10:42

fix(raft): fix gci alignment in syncResult const block

a434761

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

julienrbrt approved these changes Apr 20, 2026

View reviewed changes

auricom added this pull request to the merge queue Apr 20, 2026

Merged via the queue into main with commit d9046e6 Apr 20, 2026
48 of 51 checks passed

auricom deleted the fix/3229-raft-re-election branch April 20, 2026 09:50

This was referenced Apr 20, 2026

Raft leader election does not gate on execution layer sync: unsynchronized node can win election and block production indefinitely #3255

Closed

Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster #3229

Closed

auricom restored the fix/3229-raft-re-election branch April 20, 2026 09:52

Conversation

auricom commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Leader fencing on SIGTERM

Raft log compaction

Election timeout config

Follower crash on restart (EVM ahead of stale snapshot)

FSM data race

BoltDB log noise

Block provenance audit log

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

claude bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Critical / Must-fix

Major / Should-fix

Minor / Nitpick

Concurrency / Safety Review

Testing

Positive Observations

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

auricom commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

claude bot commented Apr 7, 2026 •

edited

Loading

github-actions bot commented Apr 7, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading

auricom Apr 17, 2026 •

edited

Loading